Project - Credit Card Users Churn Prediction


Context:

Objective:

Data Information

The records contain the Customer's personal information and their travel details & patterns. It also contains Customer interaction information during their sales pitch and their learnings from those sales discussions.

The detailed data dictionary is given below:

Customer Details


Table of Contents (TOC)

- Importing Packages
- Unwrapping Customer Information
- Data Pre-Processing & Sanity Checks
- Summary of Data Analysis
- EDA Analysis
- Model Building
- Model Analysis - Original Data
- Model Analysis - Oversampling data
- Model Analysis - Undersampling data
- Comparison Models with Data - Original vs Oversample vs Undersample
- Model Performance on Test dataset
- Pipelines for productionizing the model
- Recommendations

Importing required Packages:

Click to return to TOC



Unwrapping the Customer Information:

Click to return to TOC


Data Description: Click to return to TOC


Data Preprocessing & Sanity Checks

Click to return to TOC


Observations:

Dropping the Customer ID Column

Checking for Duplicates

Checking for Columns with missing values

Observations:

Validating the values of the columns to observe the pattern and data correctness

Observations:

Click to return to TOC

Inferences:

Replacing the incorrect value for Income Category type

Replacing the Attrition Flag text to binary


Summary of Data Analysis

Click to return to TOC

Data Structure:

Data Cleaning:

Data Description:

For more details, Click here for Data descriptions & Click here for Feature Value observations


Common Functions


EDA Analysis - Analyzing respective attributes to understand the data pattern

Click to return to TOC


Analyzing the count and percentage of Categorical attributes using a bar chart

Insights from Categorical Data

Click to return to TOC

Observations:


Analyzing the Numerical attributes using Histogram and Box Plots

Insights from Numerical Data

Click to return to TOC

Observations:

Univariate Analysis

Click to return to TOC

Aalyzing the Age of the Customers

Observations:

Analyzing the Month (period) of relationship with the bank

Observations:

Analyzing the Credit Limit

Analyzing the Total_Revolving_Bal

Observations:

Analyzing the Avg_Open_To_Buy

Observations:

Observations:

Analyzing the Total_Amt_Chng_Q4_Q1

Observations:

Analyzing the Total_Ct_Chng_Q4_Q1

Observations:

Analyzing the Total_Trans_Amt

Observations:

Analyzing the Total_Trans_Ct

Observations:

Analyzing the Avg_Utilization_Ratio

Observations:

Data Description Post Treament

Click to return to TOC

Bivariate Analysis

Click to return to TOC

Visualise variables association with Attrition parameter & its correlation

Analyzing the Categorial attributes with Attrition Flag

Observation:

Click to return to TOC

Attrition_Flag vs Gender

Attrition_Flag vs Dependent_count

Attrition_Flag vs Education_Level

Attrition_Flag vs Marital_Status

Attrition_Flag vs Income_Category

Attrition_Flag vs Card_Category

Attrition_Flag vs Total_Relationship_Count

Attrition_Flag vs Months_Inactive_12_mon

Attrition_Flag vs Contacts_Count_12_mon

Attrition_Flag vs Age_Group

Attrition_Flag vs Months_on_book_Grp

Attrition_Flag vs Credit_Limit_Grp

Analyzing the Numerical attributes with Product Taken

Observation:

Click to return to TOC

Attrition_Flag vs Customer Age

Attrition_Flag vs Months on Book

Attrition_Flag vs Total Revolving Bal

Attrition_Flag vs Total Trans Amount vs Total Trans Ct

Attrition_Flag vs Credit Limit

Attrition_Flag vs Avg. Open to Buy

Attrition_Flag vs Total Amt Change Q4-Q1 vs Total Ct Change Q4-Q1

Attrition_Flag vs Avg. Utilization Ratio

Multivariate Analysis - Visualise association with Product Taken & correlation between other Features

Click to return to TOC


Observations:

Click to return to TOC

Card_Category vs Income_Category

Observations:

Card_Category vs Income_Category

Observations:

Total_Trans_Amt vs Income_Category vs Attrition_Flag

Observations:

Income Category vs Credit_Limit vs Customer Age vs Attrition_Flag

Observations:

Income Category vs Customer_Age vs Total_Trans_Amt vs Attrition_Flag

Total_Trans_Amt vs Card_Category vs Attrition_Flag

Observations


Model Building

Click to return to TOC

Data Preparation for Modeling


Split Data

Missing Value Treatment


Building the model

Click to return to TOC

Model evaluation criterion:

The model can make wrong predictions as:

  1. Predicting that the customer will stay with their credit services but customer leaves the credit service - Loss of resources
  2. Predicting a customer will leave their credit card services but the customer doesn't leave - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Model Analysis - Original data

Click to return to TOC

Oversampling training data using SMOTE


Model Analysis - Oversampling data

Click to return to TOC

Undersampling training data using Random Undersampler

Model Analysis - Undersampling data

Click to return to TOC

Comparison Models with - Original vs Oversample vs Undersample Data - 21 models

Picking 3 best models from the 7 x 3 Matrix (Regular Set, Over Sampling Set & Under Sampling Set)


Adaboost

Grid Search

Checking model performance

Observations:

Randomized Search

Checking model performance

Observations:

Gradient Boost

Grid Search

Checking model performance

Observations:

Randomized Search

Checking model performance

Observations:

XGBoost

Grid Search

Observations:

Randomized Search

Observations:

Comparing all models

Observations:


Model Performance on Test dataset

Comparing the Training vs Testing data

Observations:

- XGBoost Grid & XGBoost Random: Accuracy has dropped significantly on Test data when compared with Train data proving that the model is overfitting though this model has highest Recall scores
- Gradient Grid: The model is overfitting with the data
- Gradient Random: Test data is not well generalized and close to overfitting
- AdaBoost Grid & AdaBoost Random: Test data is generalizing well with the train data and Accuracy is also good. Considering Accuracy, next to Recall metric, AdaBoost Random is being considered as the better model among others


Based on the earlier inference, we will consider AdaBoost Random as the best model without overfitting & good accuracy when compared with the other models and will use it for the further analysis


Observations

Pipelines for productionizing the model


Recommendations:

Click to return to TOC

Based on the Customer Information:

Based on the Attrition data of the Customers, we found the following insights that can be leveraged as recommendations for understanding the Customers:


Table of Contents (TOC)

- Importing Packages
- Unwrapping Customer Information
- Data Pre-Processing & Sanity Checks
- Summary of Data Analysis
- EDA Analysis
- Model Building
- Model Analysis - Original Data
- Model Analysis - Oversampling data
- Model Analysis - Undersampling data
- Comparison Models with Data - Original vs Oversample vs Undersample
- Model Performance on Test dataset
- Pipelines for productionizing the model
- Recommendations